Sparse partial least squares classification for high dimensional data.
نویسندگان
چکیده
Partial least squares (PLS) is a well known dimension reduction method which has been recently adapted for high dimensional classification problems in genome biology. We develop sparse versions of the recently proposed two PLS-based classification methods using sparse partial least squares (SPLS). These sparse versions aim to achieve variable selection and dimension reduction simultaneously. We consider both binary and multicategory classification. We provide analytical and simulation-based insights about the variable selection properties of these approaches and benchmark them on well known publicly available datasets that involve tumor classification with high dimensional gene expression data. We show that incorporation of SPLS into a generalized linear model (GLM) framework provides higher sensitivity in variable selection for multicategory classification with unbalanced sample sizes between classes. As the sample size increases, the two-stage approach provides comparable sensitivity with better specificity in variable selection. In binary classification and multicategory classification with balanced sample sizes, the two-stage approach provides comparable variable selection and prediction accuracy as the GLM version and is computationally more efficient.
منابع مشابه
High dimensional classification with combined adaptive sparse PLS and logistic regression
Motivation The high dimensionality of genomic data calls for the development of specific classification methodologies, especially to prevent over-optimistic predictions. This challenge can be tackled by compression and variable selection, which combined constitute a powerful framework for classification, as well as data visualization and interpretation. However, current proposed combinations le...
متن کاملSparse partial least squares regression for simultaneous dimension reduction and variable selection
Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very lar...
متن کاملMammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease
Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...
متن کاملA note on between-group PCA
In the context of binary classification with continuous predictors, we proove two properties concerning the connections between Partial Least Squares (PLS) dimension reduction and between-group PCA, and between linear discriminant analysis and between-group PCA. Such methods are of great interest for the analysis of high-dimensional data with continuous predictors, such as microarray gene expre...
متن کاملRadial Basis Function-Sparse Partial Least Squares for Application to Brain Imaging Data
Magnetic resonance imaging (MRI) data is an invaluable tool in brain morphology research. Here, we propose a novel statistical method for investigating the relationship between clinical characteristics and brain morphology based on three-dimensional MRI data via radial basis function-sparse partial least squares (RBF-sPLS). Our data consisted of MRI image intensities for multimillion voxels in ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Statistical applications in genetics and molecular biology
دوره 9 شماره
صفحات -
تاریخ انتشار 2010